Detecting Duplicate Posts in Programming QA Communities via Latent Semantics and Association Rules

نویسندگان

  • Wei Emma Zhang
  • Quan Z. Sheng
  • Jey Han Lau
  • Ermyas Abebe
چکیده

Programming community-based question-answering (PCQA) websites such as Stack Overflow enable programmers to find working solutions to their questions. Despite detailed posting guidelines, duplicate questions that have been answered are frequently created. To tackle this problem, Stack Overflow provides a mechanism for reputable users to manually mark duplicate questions. This is a laborious effort, and leads to many duplicate questions remain undetected. Existing duplicate detection methodologies from traditional community based question-answering (CQA) websites are difficult to be adopted directly to PCQA, as PCQA posts often contain source code which is linguistically very different from natural languages. In this paper, we propose a methodology designed for the PCQA domain to detect duplicate questions. We model the detection as a classification problem over question pairs. To extract features for question pairs, our methodology leverages continuous word vectors from the deep learning literature, topic model features and phrases pairs that co-occur frequently in duplicate questions mined using machine translation systems. These features capture semantic similarities between questions and produce a strong performance for duplicate detection. Experiments on a range of real-world datasets demonstrate that our method works very well; in some cases over 30% improvement compared to state-of-the-art benchmarks. As a product of one of the proposed features, the association score feature, we have mined a set of associated phrases from duplicate questions on Stack Overflow and open the dataset to the public.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Declarative Semantics in Object-Oriented Software Development - A Taxonomy and Survey

One of the modern paradigms to develop an application is object oriented analysis and design. In this paradigm, there are several objects and each object plays some specific roles in applications. In an application, we must distinguish between procedural semantics and declarative semantics for their implementation in a specific programming language. For the procedural semantics, we can write a ...

متن کامل

Towards Building Open Knowledge Base From Programming Question-Answering Communities

In this paper, we propose the first system, so-called Open Programming Knowledge Extraction (OPKE), to automatically extract knowledge from programming Question-Answering (QA) communities. OPKE is the first step of building a programming-centric knowledge base. Data mining and Natural Language Processing techniques are leveraged to identify duplicate questions and construct structured informati...

متن کامل

Detecting Redundant reduction

We present a general method for detecting redundant production rules based upon a term rewrite semantics. We present the semantic account, define rule execution over both ground memories and memory schemas, and define redundancy for production rules. From those definitions, an algorithm is developed that detects redundant rules, and which improves upon previously

متن کامل

Sift through Online Programming Discussions: Effective Search-Result Navigation via Interactive Visualization

Online programming discussion forums are widely used by programmers for troubleshooting or various problem solving tasks. Large and ever increasing volume of posts on these communities demands more efforts to read and comprehend, thus making it harder to find relevant information. In this paper we designed and studied an interactive network visualization to represent relevant search results for...

متن کامل

Hierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics

This paper discusses about the future of the World Wide Web development, called Semantic Web. Undoubtedly, Web service is one of the most important services on the Internet, which has had the greatest impact on the generalization of the Internet in human societies. Internet penetration has been an effective factor in growth of the volume of information on the Web. The massive growth of informat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017